The kiva platform provides money to the small buisnesses around the world. This money is lended as a loan and not as a charity. Ideologically anything given for free losses its value.
Kiva partners with microfinance institutions, nonprofits, and other organizations to disburse loans in the communities they serve.
Kiva never takes fees from lenders or borrowers.Kiva relies on donations, grants, and fees from certain lending partners to cover operational costs.
Individual lenders don't earn interest.Kiva aims at improved financial wellbeing for borrowers
We have a data set from kiva containing details about the lended money, buisnesses it was lended, people or buisnesses that invested, so we can analyze what value addition could kiva do to these buisnesses around the world, and how much did the company impact the life of people.
- funded_amount ... mit Ablauf der "Crowding"zeit erhaltener Betrag/ ausgezahlter Betrag in USD
- loan_amount ... Zielbetrag (Betrag dem man für das Projekt erreichen wollte) in USD
- activity ... Unterkategory zu dem das Ziel des Crowdprojektes thematisch gehört
- sector ... Oberkategory in den das Crowdprojektes Thema fällt
- use ... Kurzbeschreibung wofür das Geld verwendet werden soll
- country_code ... Ländercode nach ISO Norm
- country ... Ländername nach ISO Norm
- region ... Region
- currency ... Währung in den der funded_amount dann ausgezahlt wurde
- term in months ... Dauer über die der Kredit ausgezahlt werden soll
- lender_count ...Darlehensgeber (also wieviele Personen Geld für das Projekt gegeben haben)
- borrower_genders ... Geschlecht und Anzahl der Darlehensnehmer, also diejenigen die das Crowdprojekt initiiert haben
- repayment interval ... vertraglich vereinbarte Rückzahlungsmodalitäten/-rhythmus
The data dictionary has description of every column.
##Importing necessary packages
import pandas as pd
import numpy as np
Reading the only first two entries of the data to identify the seperators.
df_input=pd.read_csv('data_abschlussprojekt.csv',sep='/n',engine='python',nrows=2)
df_input
| # funded_amount# loan_amount# activity# sector# use# country_code# country# region# currency# term_in_months# lender_count# borrower_genders# repayment_interval | |
|---|---|
| 0 | 0#300.0#300.0#Fruits & Vegetables#Food#To buy ... |
| 1 | 1#575.0#575.0#Rickshaw#Transportation#to repai... |
lets read the data into a pandas dataframe
df_kiva=pd.read_csv('data_abschlussprojekt.csv',sep='#',engine='python',skipinitialspace=True,index_col=0)
df_kiva.head()
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
Lets get info about the data frame we input from csv file.
df_kiva.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 671205 entries, 0 to 671204 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 671205 non-null float64 1 loan_amount 671205 non-null float64 2 activity 671205 non-null object 3 sector 671205 non-null object 4 use 666972 non-null object 5 country_code 671197 non-null object 6 country 671205 non-null object 7 region 614405 non-null object 8 currency 671205 non-null object 9 term_in_months 671205 non-null float64 10 lender_count 671205 non-null int64 11 borrower_genders 666984 non-null object 12 repayment_interval 671205 non-null object dtypes: float64(3), int64(1), object(9) memory usage: 71.7+ MB
Observations
-there are 671205 rows of data, and 13 columns.
-the column 'use','country_code','region',borrower_genders' has null values
Suggestions
-the column currency has data type object that can be changed to category.
-the missing values in 'country_code' and 'country' can be filled by checking corresponding values.
Lets get more info over the data using describe
df_kiva.describe()
| funded_amount | loan_amount | term_in_months | lender_count | |
|---|---|---|---|---|
| count | 671205.000000 | 671205.000000 | 671205.000000 | 671205.000000 |
| mean | 785.995061 | 842.397107 | 13.739022 | 20.590922 |
| std | 1130.398941 | 1198.660073 | 8.598919 | 28.459551 |
| min | 0.000000 | 25.000000 | 1.000000 | 0.000000 |
| 25% | 250.000000 | 275.000000 | 8.000000 | 7.000000 |
| 50% | 450.000000 | 500.000000 | 13.000000 | 13.000000 |
| 75% | 900.000000 | 1000.000000 | 14.000000 | 24.000000 |
| max | 100000.000000 | 100000.000000 | 158.000000 | 2986.000000 |
The parmeters in describe are in the appropriate range.There is no unusual observation in the above data.
Lets check the column names for anamoly.
df_kiva.columns
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
'country_code', 'country', 'region', 'currency', 'term_in_months',
'lender_count', 'borrower_genders', 'repayment_interval'],
dtype='object')
There are no empty spaces or special signs in the column names
Lets now analyze the unique values in all the columns.
df_kiva.nunique()
funded_amount 610 loan_amount 479 activity 163 sector 15 use 423452 country_code 86 country 87 region 12695 currency 67 term_in_months 148 lender_count 503 borrower_genders 11298 repayment_interval 4 dtype: int64
df_kiva.loc[df_kiva.duplicated()]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 498 | 100.0 | 100.0 | Home Energy | Personal Use | to buy a solar-powered lamp. | SV | El Salvador | NaN | USD | 14.0 | 4 | male | monthly |
| 606 | 100.0 | 100.0 | Home Energy | Personal Use | to buy a solar-powered lamp. | SV | El Salvador | NaN | USD | 14.0 | 4 | male | monthly |
| 808 | 450.0 | 450.0 | Higher education costs | Education | to pay for one semester's registration fees. | CO | Colombia | Bogotà | COP | 7.0 | 15 | female | monthly |
| 1703 | 500.0 | 500.0 | Higher education costs | Education | To buy a laptop for educational purposes. | SO | Somalia | Hargeisa | USD | 8.0 | 19 | male | monthly |
| 2317 | 250.0 | 250.0 | Poultry | Agriculture | to purchase poultry. | KE | Kenya | Ndaragwa | KES | 16.0 | 10 | female | monthly |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671200 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
| 671201 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 1 | female | monthly |
| 671202 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
| 671203 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
| 671204 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
24588 rows × 13 columns
There are 24588 values of duplicate values.
We can select non numerical values or string values as subset to define the duplicates. So we select columns like 'activity', 'sector', 'use','country','region','borrower_gender','repayment_interval'
Lets analyse with activity, sector and use first.
df_kiva.loc[df_kiva.duplicated(subset=['activity','sector','use']),:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 475.0 | 475.0 | Rickshaw | Transportation | to cover the cost of repairing rickshaw | PK | Pakistan | Lahore | PKR | 12.0 | 17 | female | irregular |
| 47 | 250.0 | 250.0 | Farming | Agriculture | to purchase potato seeds and fertilizers for g... | IN | India | Dhupguri | INR | 43.0 | 10 | female | bullet |
| 142 | 4750.0 | 4750.0 | Personal Housing Expenses | Housing | to pay for the construction of a hygienic toil... | VN | Vietnam | Hai Duong | VND | 14.0 | 162 | female, female, female, female, female, female... | bullet |
| 154 | 1325.0 | 1325.0 | Food Production/Sales | Food | to buy shea nuts. | ML | Mali | Kadiolo | XOF | 7.0 | 50 | female, female, female, female, female | monthly |
| 199 | 2850.0 | 2850.0 | Personal Housing Expenses | Housing | to pay for the construction of a hygienic toil... | VN | Vietnam | Hai Duong | VND | 14.0 | 86 | female, female, female, female, female, female | bullet |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671200 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
| 671201 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 1 | female | monthly |
| 671202 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
| 671203 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
| 671204 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
229263 rows × 13 columns
There are 24588 values of duplicates in the data , and some of the rows are identical to each other. so we use drop_duplicates to deal with the inconsistency in the data.
df_kiva.drop_duplicates(inplace=True)
df_kiva
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671179 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
| 671181 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | female | monthly |
| 671182 | 0.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | female, female | monthly |
| 671184 | 0.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | female, female | monthly |
| 671188 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | female | monthly |
646617 rows × 13 columns
671205 total values
24588 duplicate values
646617 after removing duplicates
df_kiva.isnull().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 use 3900 country_code 8 country 0 region 56158 currency 0 term_in_months 0 lender_count 0 borrower_genders 3888 repayment_interval 0 dtype: int64
df_kiva.columns
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
'country_code', 'country', 'region', 'currency', 'term_in_months',
'lender_count', 'borrower_genders', 'repayment_interval'],
dtype='object')
df_kiva.dtypes
funded_amount float64 loan_amount float64 activity object sector object use object country_code object country object region object currency object term_in_months float64 lender_count int64 borrower_genders object repayment_interval object dtype: object
There are 8 missing values in country code , lets analyse if we can subsitute those values based on the country.
df_kiva.loc[df_kiva['country_code'].isnull(),['country_code','country']]
| country_code | country | |
|---|---|---|
| 202537 | NaN | Namibia |
| 202823 | NaN | Namibia |
| 344929 | NaN | Namibia |
| 351177 | NaN | Namibia |
| 420953 | NaN | Namibia |
| 421218 | NaN | Namibia |
| 487207 | NaN | Namibia |
| 487653 | NaN | Namibia |
The country code for Namibia as per ISO norm is NA.
https://www.iso.org/obp/ui/#iso:code:3166:NA Lets substitute NaN in country code for country Namibia as NA
df_kiva.loc[df_kiva['country_code'].isnull(),['country_code']]='NA'
###check if replacement occured
df_kiva.loc[df_kiva['country']=='Namibia',['country_code','country']]
| country_code | country | |
|---|---|---|
| 202537 | NA | Namibia |
| 202823 | NA | Namibia |
| 344929 | NA | Namibia |
| 351177 | NA | Namibia |
| 420953 | NA | Namibia |
| 421218 | NA | Namibia |
| 487207 | NA | Namibia |
| 487653 | NA | Namibia |
df_kiva['repayment_interval'].unique()
array(['irregular', 'bullet', 'monthly', 'weekly'], dtype=object)
borrower_array=df_kiva['borrower_genders'].unique()
borrower_array
array(['female', 'female, female', 'female, female, female', ...,
'female, female, male, female, female, female, female, female, female, female, male, male, female, female, male, female, female, female, female, female, female, female',
'male, female, female, female, female, female, female, female, male, male, female, male, female, male, male, male',
'female, female, female, male, female, female, female, male, female, female, female, male, female, male, female, female, female, female, female, female, female, female, female, female, female, female, female, female, male'],
dtype=object)
df_kiva.loc[df_kiva['borrower_genders'].isnull(),:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 140 | 2975.0 | 2975.0 | Food Production/Sales | Food | NaN | TZ | Tanzania | NaN | TZS | 10.0 | 110 | NaN | monthly |
| 145 | 1200.0 | 1200.0 | Personal Expenses | Personal Use | NaN | PE | Peru | NaN | PEN | 20.0 | 44 | NaN | monthly |
| 170 | 4250.0 | 4250.0 | Catering | Food | NaN | TZ | Tanzania | NaN | TZS | 10.0 | 116 | NaN | monthly |
| 412 | 2350.0 | 2350.0 | Beauty Salon | Services | NaN | TZ | Tanzania | NaN | TZS | 10.0 | 75 | NaN | monthly |
| 414 | 725.0 | 725.0 | Agriculture | Agriculture | NaN | SV | El Salvador | NaN | USD | 20.0 | 19 | NaN | monthly |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 659604 | 5625.0 | 10000.0 | Weaving | Arts | NaN | BT | Bhutan | NaN | USD | 14.0 | 210 | NaN | irregular |
| 660788 | 1975.0 | 1975.0 | Home Energy | Personal Use | NaN | PS | Palestine | NaN | USD | 27.0 | 39 | NaN | monthly |
| 661718 | 800.0 | 1600.0 | Furniture Making | Manufacturing | NaN | HT | Haiti | NaN | HTG | 13.0 | 27 | NaN | irregular |
| 671151 | 0.0 | 25.0 | Livestock | Agriculture | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
| 671174 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
3888 rows × 13 columns
## Function to identify the special signs in the column borrower_genders
signs=[]
for x in borrower_array:
try:
int(x)
except:
signs.append(x)
#signs
df_kiva['borrower_genders']=df_kiva['borrower_genders'].astype('string')
df_kiva['borrower_genders'].dtypes
df_kiva.loc[:,'borrower_genders']
0 female
1 female, female
2 female
3 female
4 female
...
671179 female
671181 female
671182 female, female
671184 female, female
671188 female
Name: borrower_genders, Length: 646617, dtype: string
list_count_fnm=[]
for x in df_kiva.loc[:,'borrower_genders']:
try:
borrower=x.split(',')
borrower_count=[]
count_female = 0
count_male = 0
for i in range(0,len(borrower)):
# if match found increase count
if (borrower[i] == 'female'):
count_female = count_female + 1
elif (borrower[i] =='male'):
count_male = count_male + 1
lender_count=[count_female,count_male]
except:
lender_count=[x]
list_count_fnm.append(lender_count)
#list_count_fnm
From above preprocessing we observe that the word female and male is repeated many times in each row.
Here we try to find out the no of times the words male and female are repeated in the column.
This data will be useful to find out the percentage of female borrowers and male borrowers.
list_count_fnm=[]
for x in df_kiva.loc[:,'borrower_genders']:
borrower_count=[]
try:
count_female=x.count('female')
count_male= x.count('male')
borrower_count=[count_female,count_male]
except:
borrower_count=x
list_count_fnm.append(borrower_count)
#len(list_count_fnm)
#list_count_fnm
import re
list_count_fnm=[]
for x in df_kiva.loc[:,'borrower_genders']:
borrower_count=[]
try:
count_female = len(re.findall(r'\bfemale\b', x))
count_male = len(re.findall(r'\bmale\b', x))
borrower_count=[count_female,count_male]
except:
borrower_count=x
list_count_fnm.append(borrower_count)
#list_count_fnm
#len(list_count_fnm)
we have now obtained a list of female borrowers and male borrowers , now we can split this list as female borrowers and male borrowers.
Further we can append this as a column in the main dataframe df_kiva
## female borrower count
borrower_female=[]
for i in list_count_fnm:
try:
borrower_female.append(i[0])
except:
borrower_female.append(i)
len(borrower_female)
#borrower_female
646617
## male borrower count
borrower_male=[]
for i in list_count_fnm:
try:
borrower_male.append(i[1])
except:
borrower_male.append(i)
#borrower_male
len(borrower_male)
646617
We can now add 2 columns as borrower_female and borrower_male instead of borrower_genders
df_kiva['borrower_female']=borrower_female
#df_kiva['borrower_female']=df_kiva['borrower_female'].astype(int) ### has NA type conversion doesnot happen
#df_kiva
df_kiva['borrower_male']=borrower_male
#df_kiva
df_kiva_v1=df_kiva.drop('borrower_genders',axis=1)
df_kiva_v1
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | irregular | 1 | 0 |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | irregular | 2 | 0 |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | bullet | 1 | 0 |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | irregular | 1 | 0 |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | monthly | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671179 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | monthly | 1 | 0 |
| 671181 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | monthly | 1 | 0 |
| 671182 | 0.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | monthly | 2 | 0 |
| 671184 | 0.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | monthly | 2 | 0 |
| 671188 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | monthly | 1 | 0 |
646617 rows × 14 columns
df_kiva_v1.dtypes
funded_amount float64 loan_amount float64 activity object sector object use object country_code object country object region object currency object term_in_months float64 lender_count int64 repayment_interval object borrower_female object borrower_male object dtype: object
df_kiva_v1['borrower_female'].min() ### output is zero , so we check corresponding value in male borrower column
0
df_kiva_v1.loc[df_kiva_v1.loc[:,'borrower_female']==0,'borrower_male']
49 1
50 5
53 1
67 1
69 1
..
671136 1
671137 1
671143 1
671145 1
671147 1
Name: borrower_male, Length: 133307, dtype: object
It has 133307 values now we will see if both male and female values are zero at the same time
df_kiva_v1.loc[(df_kiva_v1.loc[:,'borrower_female']==0) & (df_kiva_v1.loc[:,'borrower_male']==0),:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male |
|---|
There are no such values so we can check the OR condition now
df_kiva_v1.loc[(df_kiva_v1.loc[:,'borrower_female']==0) | (df_kiva_v1.loc[:,'borrower_male']==0),:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | irregular | 1 | 0 |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | irregular | 2 | 0 |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | bullet | 1 | 0 |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | irregular | 1 | 0 |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | monthly | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671179 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | monthly | 1 | 0 |
| 671181 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | monthly | 1 | 0 |
| 671182 | 0.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | monthly | 2 | 0 |
| 671184 | 0.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | monthly | 2 | 0 |
| 671188 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | monthly | 1 | 0 |
604210 rows × 14 columns
Most of the values have either female or male value zero.
We can also check this by group by and aggregation
female & male 0 0 ----> zero cases
0 OR 0 ----> 604210 cases
0 & 1 ----> 129726 cases
1 & 0 ----> 415152 cases
1 & 1 ----> 2475 cases
df_kiva_v1.loc[(df_kiva_v1.loc[:,'borrower_female']==0) & (df_kiva_v1.loc[:,'borrower_male']==1),:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 49 | 450.0 | 450.0 | General Store | Retail | to stock his store. | SV | El Salvador | NaN | USD | 14.0 | 18 | monthly | 0 | 1 |
| 53 | 550.0 | 550.0 | Personal Expenses | Personal Use | to buy a cradle and household items for his yo... | PE | Peru | Ica | PEN | 14.0 | 20 | monthly | 0 | 1 |
| 67 | 125.0 | 125.0 | Energy | Services | purchase solar lanterns for resale. | KE | Kenya | NaN | KES | 3.0 | 6 | irregular | 0 | 1 |
| 69 | 1075.0 | 1075.0 | Transportation | Transportation | to pay for general maintenance on his vehicle. | PE | Peru | Ica | PEN | 17.0 | 41 | monthly | 0 | 1 |
| 70 | 2000.0 | 2000.0 | Retail | Retail | to install a display window and a sunshade for... | IQ | Iraq | NaN | USD | 15.0 | 71 | monthly | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671136 | 0.0 | 350.0 | Personal Medical Expenses | Health | to buy medicine to start treatment. | TJ | Tajikistan | Qubadiyon | TJS | 14.0 | 0 | monthly | 0 | 1 |
| 671137 | 50.0 | 1150.0 | Higher education costs | Education | to pay his nephews' tuition, so that they can ... | TJ | Tajikistan | Qurgan-Tube | TJS | 14.0 | 2 | monthly | 0 | 1 |
| 671143 | 125.0 | 1175.0 | Farming | Agriculture | to purchase seeds and fertilizer for his green... | UA | Ukraine | Gladkovka | UAH | 14.0 | 5 | monthly | 0 | 1 |
| 671145 | 175.0 | 775.0 | Flowers | Agriculture | to purchase seedlings, pots for flowers and fe... | UA | Ukraine | Vinogradovo | UAH | 12.0 | 7 | monthly | 0 | 1 |
| 671147 | 0.0 | 250.0 | Food Stall | Food | to buy watermelon to start a new business. | TJ | Tajikistan | Spitamen | TJS | 14.0 | 0 | monthly | 0 | 1 |
129726 rows × 14 columns
df_kiva_v1.loc[(df_kiva_v1.loc[:,'borrower_female']==1) & (df_kiva_v1.loc[:,'borrower_male']==0),:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | irregular | 1 | 0 |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | bullet | 1 | 0 |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | irregular | 1 | 0 |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | monthly | 1 | 0 |
| 5 | 250.0 | 250.0 | Services | Services | purchase leather for my business using ksh 20000. | KE | Kenya | NaN | KES | 4.0 | 6 | irregular | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671166 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 1 | monthly | 1 | 0 |
| 671168 | 0.0 | 25.0 | Livestock | Agriculture | Pretend the issue with loan got addressed by K... | KE | Kenya | NaN | KES | 13.0 | 0 | monthly | 1 | 0 |
| 671179 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | monthly | 1 | 0 |
| 671181 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | monthly | 1 | 0 |
| 671188 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | monthly | 1 | 0 |
415152 rows × 14 columns
df_kiva_v1.loc[(df_kiva_v1.loc[:,'borrower_female']==1) & (df_kiva_v1.loc[:,'borrower_male']==1),:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4854 | 625.0 | 625.0 | Farming | Agriculture | to harvesting the paddy rice. | KH | Cambodia | Battambang province, Moung Russey district | KHR | 14.0 | 24 | monthly | 1 | 1 |
| 5026 | 775.0 | 775.0 | Auto Repair | Services | to buy spare parts for a mototaxi | GT | Guatemala | San Martin Jilotepeque, Chimaltenango | GTQ | 14.0 | 22 | monthly | 1 | 1 |
| 7112 | 200.0 | 200.0 | Farming | Agriculture | to purchase hybrid seeds and fertilizer to gro... | KE | Kenya | Kuria west | KES | 10.0 | 8 | bullet | 1 | 1 |
| 7906 | 100.0 | 100.0 | Home Appliances | Personal Use | To build a decent latrine for his household | KH | Cambodia | Kandal province, Kandal Stoeng district | KHR | 14.0 | 1 | monthly | 1 | 1 |
| 7942 | 725.0 | 725.0 | Retail | Retail | to expand his business on a larger scale. | KH | Cambodia | Battambang province, Sangke district | KHR | 14.0 | 15 | monthly | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 669199 | 0.0 | 1000.0 | Farming | Agriculture | to purchase rice seeds for farming | KH | Cambodia | Svay Reang Province | KHR | 20.0 | 0 | monthly | 1 | 1 |
| 669818 | 50.0 | 750.0 | Agriculture | Agriculture | to buy fertilizer, oil and seeds for growing r... | KH | Cambodia | Kratie Province, Prey Prorsorb District | KHR | 19.0 | 2 | monthly | 1 | 1 |
| 669835 | 0.0 | 1000.0 | Fishing | Food | to buy and assemble a gill net and buy a motor | CD | The Democratic Republic of the Congo | IDJWI | USD | 11.0 | 0 | monthly | 1 | 1 |
| 669903 | 25.0 | 700.0 | Fishing | Food | to purchase a canoe for his fishing business a... | CD | The Democratic Republic of the Congo | IDJWI | USD | 9.0 | 1 | monthly | 1 | 1 |
| 670860 | 0.0 | 800.0 | Farming | Agriculture | to maintain his pineapple field and to buy pin... | CD | The Democratic Republic of the Congo | IDJWI | USD | 14.0 | 0 | monthly | 1 | 1 |
2475 rows × 14 columns
#df_kiva_v1.groupby(by=['borrower_female','borrower_male']).agg(countf=('borrower_female','count'),countm=('borrower_male','count'))
There are minimum of 1 and 1 value in both female and male column.
there are 3888 missing values in both columns.
We can replace these missing values with 1 as there has to be atleast 1 borrower either its female or male, the kiva website says they have more female borrowers.
To be fair in our analysis we can replace missing values with 1 in both columns.
df_kiva_v1['borrower_female'].fillna(1, inplace=True)
df_kiva_v1.loc[df_kiva_v1.loc[:,'borrower_female'].isnull()]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male |
|---|
df_kiva_v1['borrower_male'].fillna(1, inplace=True)
df_kiva_v1.loc[df_kiva_v1.loc[:,'borrower_male'].isnull()]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male |
|---|
df_kiva_v1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 646617 entries, 0 to 671188 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 646617 non-null float64 1 loan_amount 646617 non-null float64 2 activity 646617 non-null object 3 sector 646617 non-null object 4 use 642717 non-null object 5 country_code 646617 non-null object 6 country 646617 non-null object 7 region 590459 non-null object 8 currency 646617 non-null object 9 term_in_months 646617 non-null float64 10 lender_count 646617 non-null int64 11 repayment_interval 646617 non-null object 12 borrower_female 646617 non-null int64 13 borrower_male 646617 non-null int64 dtypes: float64(3), int64(3), object(8) memory usage: 74.0+ MB
df_kiva_v1.isnull().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 use 3900 country_code 0 country 0 region 56158 currency 0 term_in_months 0 lender_count 0 repayment_interval 0 borrower_female 0 borrower_male 0 dtype: int64
df_kiva_v1['repayment_interval'].unique()
array(['irregular', 'bullet', 'monthly', 'weekly'], dtype=object)
This column has only four unique values and it has a data type object. we can convert this data type to category to save memory usauge
df_kiva_v1.loc[:,'repayment_interval'] = df_kiva_v1.loc[:,'repayment_interval'].astype('category')
df_kiva_v1.dtypes
funded_amount float64 loan_amount float64 activity object sector object use object country_code object country object region object currency object term_in_months float64 lender_count int64 repayment_interval category borrower_female int64 borrower_male int64 dtype: object
The columns 'activity','sector', 'use' seem to be related.
Lets check the text input in these columns.
df_kiva_v1.iloc[:,2:5]
| activity | sector | use | |
|---|---|---|---|
| 0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. |
| 1 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... |
| 2 | Transportation | Transportation | To repair their old cycle-van and buy another ... |
| 3 | Embroidery | Arts | to purchase an embroidery machine and a variet... |
| 4 | Milk Sales | Food | to purchase one buffalo. |
| ... | ... | ... | ... |
| 671179 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... |
| 671181 | Livestock | Agriculture | Reviewed loan use in english. |
| 671182 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. |
| 671184 | Livestock | Agriculture | Translated loan use to english. |
| 671188 | Livestock | Agriculture | Reviewed loan use in english. |
646617 rows × 3 columns
#df_kiva_v1.groupby(by='activity').agg(count=('activity','count'))
df_kiva_v1['activity'].nunique()
163
df_kiva_v1['use'].nunique()
423452
df_kiva_v1.loc[df_kiva.loc[:,'use'].isnull() &
df_kiva.loc[:,'region'].isnull() , :]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 140 | 2975.0 | 2975.0 | Food Production/Sales | Food | NaN | TZ | Tanzania | NaN | TZS | 10.0 | 110 | monthly | 1 | 1 |
| 145 | 1200.0 | 1200.0 | Personal Expenses | Personal Use | NaN | PE | Peru | NaN | PEN | 20.0 | 44 | monthly | 1 | 1 |
| 170 | 4250.0 | 4250.0 | Catering | Food | NaN | TZ | Tanzania | NaN | TZS | 10.0 | 116 | monthly | 1 | 1 |
| 412 | 2350.0 | 2350.0 | Beauty Salon | Services | NaN | TZ | Tanzania | NaN | TZS | 10.0 | 75 | monthly | 1 | 1 |
| 414 | 725.0 | 725.0 | Agriculture | Agriculture | NaN | SV | El Salvador | NaN | USD | 20.0 | 19 | monthly | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 659604 | 5625.0 | 10000.0 | Weaving | Arts | NaN | BT | Bhutan | NaN | USD | 14.0 | 210 | irregular | 1 | 1 |
| 660788 | 1975.0 | 1975.0 | Home Energy | Personal Use | NaN | PS | Palestine | NaN | USD | 27.0 | 39 | monthly | 1 | 1 |
| 661718 | 800.0 | 1600.0 | Furniture Making | Manufacturing | NaN | HT | Haiti | NaN | HTG | 13.0 | 27 | irregular | 1 | 1 |
| 671151 | 0.0 | 25.0 | Livestock | Agriculture | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | monthly | 1 | 1 |
| 671174 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | monthly | 1 | 1 |
3894 rows × 14 columns
When both use column and region column have nan values there are significant values of funded amount, loan amount for every row , so we cannot just drop these na values.
Idea 1: we can concat column activity and sector and add it to column use. then those missing values will be filled.
Idea 2: we can substitute 'Unknow' instead of missing values.
Idea 3: The column region will be useful to plot on the world map but it has lots of values missing. We can subsitute like column use with unknown.
Idea 4 : The column region can be deleted as a whole and we can just use country to plot on world map.
df_kiva_v1.dtypes
funded_amount float64 loan_amount float64 activity object sector object use object country_code object country object region object currency object term_in_months float64 lender_count int64 repayment_interval category borrower_female int64 borrower_male int64 dtype: object
df_kiva_v1['use'].fillna('Unknown',inplace=True)
df_kiva_v1['region'].fillna('Unknown',inplace=True)
df_kiva_v1.isnull().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 use 0 country_code 0 country 0 region 0 currency 0 term_in_months 0 lender_count 0 repayment_interval 0 borrower_female 0 borrower_male 0 dtype: int64
df_kiva_v1.describe()
| funded_amount | loan_amount | term_in_months | lender_count | borrower_female | borrower_male | |
|---|---|---|---|---|---|---|
| count | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 |
| mean | 808.045242 | 866.458816 | 13.719826 | 21.115360 | 1.610250 | 0.419273 |
| std | 1145.419041 | 1214.276106 | 8.479912 | 28.840674 | 3.077908 | 1.133063 |
| min | 0.000000 | 25.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 275.000000 | 275.000000 | 8.000000 | 7.000000 | 1.000000 | 0.000000 |
| 50% | 475.000000 | 500.000000 | 13.000000 | 13.000000 | 1.000000 | 0.000000 |
| 75% | 925.000000 | 1000.000000 | 14.000000 | 25.000000 | 1.000000 | 1.000000 |
| max | 100000.000000 | 100000.000000 | 158.000000 | 2986.000000 | 50.000000 | 44.000000 |
The columns funded amount and lender count have 0 as minimum value and loan amount has minimum of 25.
Lowest funded value is zero
df_kiva_v1.loc[(df_kiva_v1.loc[:,'funded_amount']==0) &
(df_kiva_v1.loc[:,'lender_count']==0) &
(df_kiva_v1.loc[:,'loan_amount']==25),:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 671151 | 0.0 | 25.0 | Livestock | Agriculture | Unknown | KE | Kenya | Unknown | KES | 13.0 | 0 | monthly | 1 | 1 |
| 671160 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | Unknown | KES | 13.0 | 0 | monthly | 1 | 0 |
| 671167 | 0.0 | 25.0 | Livestock | Agriculture | Kiva Coordinator fixed issue loan (no longer v... | KE | Kenya | Unknown | KES | 13.0 | 0 | monthly | 2 | 0 |
| 671168 | 0.0 | 25.0 | Livestock | Agriculture | Pretend the issue with loan got addressed by K... | KE | Kenya | Unknown | KES | 13.0 | 0 | monthly | 1 | 0 |
| 671171 | 0.0 | 25.0 | Games | Entertainment | Kiva Coordinator replaced loan use. Should see... | KE | Kenya | Unknown | KES | 13.0 | 0 | monthly | 2 | 0 |
| 671174 | 0.0 | 25.0 | Games | Entertainment | Unknown | KE | Kenya | Unknown | KES | 13.0 | 0 | monthly | 1 | 1 |
| 671179 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | monthly | 1 | 0 |
| 671181 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | monthly | 1 | 0 |
Highest funded value is 100000 USD which also has highest lender count.
df_kiva_v1.loc[(df_kiva_v1.loc[:,'funded_amount']==100000) ,:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 70499 | 100000.0 | 100000.0 | Agriculture | Agriculture | create more than 300 jobs for women and farmer... | HT | Haiti | Les Cayes | USD | 75.0 | 2986 | irregular | 1 | 0 |
Distribution or total funded amount as per country
df_country=df_kiva_v1.groupby(by='country').agg(total=('funded_amount','sum'))
df_country=df_country.sort_values(by='total',ascending=False)
df_country
| total | |
|---|---|
| country | |
| Philippines | 53171450.0 |
| Kenya | 31805605.0 |
| Peru | 30386450.0 |
| Paraguay | 29401725.0 |
| El Salvador | 23345975.0 |
| ... | ... |
| Mauritania | 15000.0 |
| Afghanistan | 14000.0 |
| Vanuatu | 9250.0 |
| Guam | 395.0 |
| Virgin Islands | 0.0 |
87 rows × 1 columns
import plotly_express as px
mybar=px.bar(df_country,
x=df_country.columns,
y=df_country.index,
labels={'x':'Country','y':'Total funded amount'})
mybar.show()
df_currency=df_kiva_v1.groupby(by='currency').agg(total=('funded_amount','sum'))
df_currency
| total | |
|---|---|
| currency | |
| ALL | 2440000.0 |
| AMD | 6582800.0 |
| AZN | 2696025.0 |
| BIF | 2558550.0 |
| BOB | 16350975.0 |
| ... | ... |
| XOF | 20000575.0 |
| YER | 1784075.0 |
| ZAR | 531275.0 |
| ZMW | 1074975.0 |
| ZWD | 50.0 |
67 rows × 1 columns
df_kiva_v1['currency'].unique()
array(['PKR', 'INR', 'KES', 'NIO', 'USD', 'TZS', 'PHP', 'PEN', 'XOF',
'LRD', 'VND', 'HNL', 'MNT', 'COP', 'GTQ', 'TJS', 'BOB', 'YER',
'KHR', 'GHS', 'SLL', 'HTG', 'CLP', 'JOD', 'UGX', 'BIF', 'IDR',
'GEL', 'UAH', 'EUR', 'ALL', 'CRC', 'XAF', 'TRY', 'AZN', 'DOP',
'BRL', 'MXN', 'KGS', 'AMD', 'PYG', 'LBP', 'WST', 'ILS', 'RWF',
'ZMW', 'NPR', 'MZN', 'ZAR', 'BZD', 'SRD', 'NGN', 'VUV', 'XCD',
'MWK', 'LAK', 'MMK', 'ZWD', 'MDL', 'SSP', 'SBD', 'CNY', 'EGP',
'MGA', 'NAD', 'LSL', 'THB'], dtype=object)
df_kiva_v1['country'].unique()
array(['Pakistan', 'India', 'Kenya', 'Nicaragua', 'El Salvador',
'Tanzania', 'Philippines', 'Peru', 'Senegal', 'Cambodia',
'Liberia', 'Vietnam', 'Iraq', 'Honduras', 'Palestine', 'Mongolia',
'United States', 'Mali', 'Colombia', 'Tajikistan', 'Guatemala',
'Ecuador', 'Bolivia', 'Yemen', 'Ghana', 'Sierra Leone', 'Haiti',
'Chile', 'Jordan', 'Uganda', 'Burundi', 'Burkina Faso',
'Timor-Leste', 'Indonesia', 'Georgia', 'Ukraine', 'Kosovo',
'Albania', 'The Democratic Republic of the Congo', 'Costa Rica',
'Somalia', 'Zimbabwe', 'Cameroon', 'Turkey', 'Azerbaijan',
'Dominican Republic', 'Brazil', 'Mexico', 'Kyrgyzstan', 'Armenia',
'Paraguay', 'Lebanon', 'Samoa', 'Israel', 'Rwanda', 'Zambia',
'Nepal', 'Congo', 'Mozambique', 'South Africa', 'Togo', 'Benin',
'Belize', 'Suriname', 'Thailand', 'Nigeria', 'Mauritania',
'Vanuatu', 'Panama', 'Virgin Islands',
'Saint Vincent and the Grenadines',
"Lao People's Democratic Republic", 'Malawi', 'Myanmar (Burma)',
'Moldova', 'South Sudan', 'Solomon Islands', 'China', 'Egypt',
'Guam', 'Afghanistan', 'Madagascar', 'Namibia', 'Puerto Rico',
'Lesotho', "Cote D'Ivoire", 'Bhutan'], dtype=object)
unique_cc = df_kiva_v1.groupby('country')['currency'].apply(pd.Series.unique)
unique_cc
country
Afghanistan [USD]
Albania [ALL, USD]
Armenia [AMD, USD]
Azerbaijan [AZN, USD]
Belize [BZD]
...
Vietnam [VND]
Virgin Islands [USD]
Yemen [YER]
Zambia [ZMW, USD]
Zimbabwe [USD, ZWD]
Name: currency, Length: 87, dtype: object
There are 87 countries and 67 currencies. USD is world wide accepted currency. Some countries are using both their own currency and USD. The values of funded_amount and loan_amount are in USD so the currency column has no impact on any other columns. As stated in data dictionary ,it represents the currency in which it is recieved by borrowers. Again it has no impact on any columns in data.
import matplotlib.pyplot as plt
import seaborn as sns
# select numerical columns
num_cols = ['funded_amount', 'loan_amount', 'term_in_months', 'lender_count','borrower_female','borrower_male']
# create subplots for each column
fig, axes = plt.subplots(nrows=1, ncols=len(num_cols), figsize=(15,5))
# create boxplot for each column in its corresponding subplot
for i, col in enumerate(num_cols):
sns.boxplot(x=df_kiva_v1[col], ax=axes[i])
axes[i].set_title(col)
plt.show()
df_kiva_v1.corr()
| funded_amount | loan_amount | term_in_months | lender_count | borrower_female | borrower_male | |
|---|---|---|---|---|---|---|
| funded_amount | 1.000000 | 0.944453 | 0.154112 | 0.847721 | 0.468677 | 0.196461 |
| loan_amount | 0.944453 | 1.000000 | 0.191142 | 0.796713 | 0.440255 | 0.207362 |
| term_in_months | 0.154112 | 0.191142 | 1.000000 | 0.233640 | -0.153909 | -0.027537 |
| lender_count | 0.847721 | 0.796713 | 0.233640 | 1.000000 | 0.262901 | 0.170445 |
| borrower_female | 0.468677 | 0.440255 | -0.153909 | 0.262901 | 1.000000 | 0.172336 |
| borrower_male | 0.196461 | 0.207362 | -0.027537 | 0.170445 | 0.172336 | 1.000000 |
The Range of data in all the columns varies widely.
We can still see a co relation that as the loan amount or funded amount has increased the lenders has also increased.
This data can be better analysed by grouping it with respect to country or sector.
We can gain insights from this data as:
-Which country has highest funded amount?
-which sector has highest funds?
-what is the percentage of female and male borrowers?
-which sector is falling short of funds?
-which sectors are the lenders willing to contribute?
-which country are the lenders willing to contribute?
-what is the average amount lended per person?
We can visualize the data with respect to following factors:
-Distribution of funds countrywise and sector wise.
-Distribution of funds geographically plotted on a map.
-Distribution of fund in males/females.
It is in the interest of our analysis to keep the extreme values.
There are funded_amount as high as 100000 USD and as low as Zero. We can thus observe that the Kiva platform is catering to a large range of buisnesses. It thus creates impact at the highest economic levels to the most remote parts of the society.
The extreme values appear to be outliers in the statistical sense but when we analyse the same data with respect to country or sector in which the investments happen, we can analyse what kind of buisnesses are success or failures, we can thus look at a larger picture.
Refering to the section 1.4.2 Column borrower_genders, it was used to extract information of no of males and no of females borrowing money for a project.
This data extraction reduces the memory usuage for the data frame, and also helps to gain insights from the data.
We can combine some values to form KPI
-payback_amount
-deficit_amount_per
-avg_lending_amount
we have data in column funded_amount and term_in_month, from this we can derive the payback_amount as
payback_amount= funded_amount/term_in_month
the minimum value in term_in_month is 1 , division by zero error does not occur.
we can compare this value with the repayment_interval column to find out how repayment happens.
The borrowers can accordingly plan the repayment of loan.
The difference between the loan amount and funded amount gives us the deficit amount
deficit_amount=loan_amount -funded_amount
We can give the deficit amount for each project and this helps to further analyse which projects did not recieve funding and what are the reasons.
The funded_amount is collective sum of many lenders. The average amount lended per person can help us analyse the lending power by individuals.
avg_lending_amount= loan_amount/lender_count
lender_count has minimum value zero , that can be substitued by the minimum loan amount
This feature can help analyse lending power of people.
repayment_amount=round((df_kiva_v1['funded_amount']/df_kiva_v1['term_in_months']),2)
repayment_amount
0 25.00
1 52.27
2 3.49
3 18.18
4 28.57
...
671179 0.00
671181 0.00
671182 0.00
671184 0.00
671188 0.00
Length: 646617, dtype: float64
df_kiva_v1.loc[:,'repayment_amount']=repayment_amount
df_kiva_v1
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | repayment_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | irregular | 1 | 0 | 25.00 |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | irregular | 2 | 0 | 52.27 |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | bullet | 1 | 0 | 3.49 |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | irregular | 1 | 0 | 18.18 |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | monthly | 1 | 0 | 28.57 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671179 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | monthly | 1 | 0 | 0.00 |
| 671181 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | monthly | 1 | 0 | 0.00 |
| 671182 | 0.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | monthly | 2 | 0 | 0.00 |
| 671184 | 0.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | monthly | 2 | 0 | 0.00 |
| 671188 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | monthly | 1 | 0 | 0.00 |
646617 rows × 15 columns
deficit_amount = df_kiva_v1['loan_amount'] -df_kiva_v1['funded_amount']
deficit_amount.sort_values()
type(deficit_amount)
pandas.core.series.Series
df_kiva_v1.loc[df_kiva_v1.loc[:,'funded_amount']> df_kiva_v1.loc[:,'loan_amount'],:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | repayment_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 277188 | 425.0 | 400.0 | General Store | Retail | to buy beverages, rice, laundry detergent, sug... | MZ | Mozambique | Boane, Maputo | MZN | 17.0 | 11 | monthly | 0 | 1 | 25.00 |
| 338159 | 3400.0 | 3000.0 | Farm Supplies | Agriculture | to pay for wires for the grape orchard, cover ... | AM | Armenia | Hoktember village, Armavir region | USD | 38.0 | 84 | monthly | 0 | 1 | 89.47 |
There are two cases where more funds were raised than the loan amount.
lets substitue this value as zero
df_kiva_v1.loc[:,'deficit_amount'] = deficit_amount
df_kiva_v1
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | repayment_amount | deficit_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | irregular | 1 | 0 | 25.00 | 0.0 |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | irregular | 2 | 0 | 52.27 | 0.0 |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | bullet | 1 | 0 | 3.49 | 0.0 |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | irregular | 1 | 0 | 18.18 | 0.0 |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | monthly | 1 | 0 | 28.57 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671179 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | monthly | 1 | 0 | 0.00 | 25.0 |
| 671181 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | monthly | 1 | 0 | 0.00 | 25.0 |
| 671182 | 0.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | monthly | 2 | 0 | 0.00 | 125.0 |
| 671184 | 0.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | monthly | 2 | 0 | 0.00 | 875.0 |
| 671188 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | monthly | 1 | 0 | 0.00 | 250.0 |
646617 rows × 16 columns
## substituting the negative values with zero .
df_kiva_v1.loc[(df_kiva_v1.loc[:,'deficit_amount'] < 0),'deficit_amount']
df_kiva_v1.loc[(df_kiva_v1['deficit_amount'] < 0),'deficit_amount']= 0
df_kiva_v1.loc[:,'deficit_amount'].min()
0.0
df_kiva_v1.loc[:,'loan_amount'].min()
25.0
We are checking when lender_count is zero than the funded amount should be zero.
df_kiva_v1.loc[(df_kiva_v1.loc[:,'lender_count']==0) &
(df_kiva_v1.loc[:,'funded_amount']==0),:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | repayment_amount | deficit_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4814 | 0.0 | 5000.0 | Food Production/Sales | Food | grow and create new jobs in my company. | US | United States | Unknown | USD | 24.0 | 0 | bullet | 1 | 0 | 0.0 | 5000.0 |
| 6784 | 0.0 | 10000.0 | Cosmetics Sales | Retail | start a new stream of income by offering retai... | US | United States | Unknown | USD | 36.0 | 0 | bullet | 0 | 1 | 0.0 | 10000.0 |
| 10735 | 0.0 | 5000.0 | Services | Services | purchase new equipment and market to more pote... | US | United States | Unknown | USD | 24.0 | 0 | bullet | 0 | 1 | 0.0 | 5000.0 |
| 12037 | 0.0 | 5000.0 | Food Production/Sales | Food | purchase much needed packaging that is sustain... | US | United States | Unknown | USD | 24.0 | 0 | bullet | 0 | 1 | 0.0 | 5000.0 |
| 13542 | 0.0 | 1500.0 | Crafts | Arts | increase my inventory of handmade candles, dri... | US | United States | Unknown | USD | 12.0 | 0 | bullet | 0 | 1 | 0.0 | 1500.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671179 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | monthly | 1 | 0 | 0.0 | 25.0 |
| 671181 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | monthly | 1 | 0 | 0.0 | 25.0 |
| 671182 | 0.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | monthly | 2 | 0 | 0.0 | 125.0 |
| 671184 | 0.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | monthly | 2 | 0 | 0.0 | 875.0 |
| 671188 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | monthly | 1 | 0 | 0.0 | 250.0 |
3278 rows × 16 columns
When lender count is zero the funded amount is zero, that means the projects are not funded.
try:
avg_lending_amount = round((df_kiva_v1.loc[:,'funded_amount']/df_kiva_v1.loc[:,'lender_count']),2)
except ZeroDivisionError:
avg_lending_amount = 0
avg_lending_amount
0 25.00
1 41.07
2 25.00
3 25.00
4 25.00
...
671179 NaN
671181 NaN
671182 NaN
671184 NaN
671188 NaN
Length: 646617, dtype: float64
df_kiva_v1.loc[:,'avg_lending_amount'] = avg_lending_amount
df_kiva_v1
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | repayment_amount | deficit_amount | avg_lending_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | irregular | 1 | 0 | 25.00 | 0.0 | 25.00 |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | irregular | 2 | 0 | 52.27 | 0.0 | 41.07 |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | bullet | 1 | 0 | 3.49 | 0.0 | 25.00 |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | irregular | 1 | 0 | 18.18 | 0.0 | 25.00 |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | monthly | 1 | 0 | 28.57 | 0.0 | 25.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671179 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | monthly | 1 | 0 | 0.00 | 25.0 | NaN |
| 671181 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | monthly | 1 | 0 | 0.00 | 25.0 | NaN |
| 671182 | 0.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | monthly | 2 | 0 | 0.00 | 125.0 | NaN |
| 671184 | 0.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | monthly | 2 | 0 | 0.00 | 875.0 | NaN |
| 671188 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | monthly | 1 | 0 | 0.00 | 250.0 | NaN |
646617 rows × 17 columns
df_kiva_v1.isnull().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 use 0 country_code 0 country 0 region 0 currency 0 term_in_months 0 lender_count 0 repayment_interval 0 borrower_female 0 borrower_male 0 repayment_amount 0 deficit_amount 0 avg_lending_amount 3278 dtype: int64
The division by zero values are substituted as Nan values, we can substitute that with zero.
df_kiva_v1.loc[:,'avg_lending_amount'].fillna(0, inplace=True)
df_kiva_v1
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | repayment_amount | deficit_amount | avg_lending_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | irregular | 1 | 0 | 25.00 | 0.0 | 25.00 |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | irregular | 2 | 0 | 52.27 | 0.0 | 41.07 |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | bullet | 1 | 0 | 3.49 | 0.0 | 25.00 |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | irregular | 1 | 0 | 18.18 | 0.0 | 25.00 |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | monthly | 1 | 0 | 28.57 | 0.0 | 25.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671179 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | monthly | 1 | 0 | 0.00 | 25.0 | 0.00 |
| 671181 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | monthly | 1 | 0 | 0.00 | 25.0 | 0.00 |
| 671182 | 0.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | monthly | 2 | 0 | 0.00 | 125.0 | 0.00 |
| 671184 | 0.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | monthly | 2 | 0 | 0.00 | 875.0 | 0.00 |
| 671188 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | monthly | 1 | 0 | 0.00 | 250.0 | 0.00 |
646617 rows × 17 columns
df_kiva_v1.describe()
| funded_amount | loan_amount | term_in_months | lender_count | borrower_female | borrower_male | repayment_amount | deficit_amount | avg_lending_amount | |
|---|---|---|---|---|---|---|---|---|---|
| count | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 |
| mean | 808.045242 | 866.458816 | 13.719826 | 21.115360 | 1.610250 | 0.419273 | 72.433627 | 58.414231 | 61.655100 |
| std | 1145.419041 | 1214.276106 | 8.479912 | 28.840674 | 3.077908 | 1.133063 | 127.806460 | 399.069402 | 181.508479 |
| min | 0.000000 | 25.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 275.000000 | 275.000000 | 8.000000 | 7.000000 | 1.000000 | 0.000000 | 23.750000 | 0.000000 | 26.830000 |
| 50% | 475.000000 | 500.000000 | 13.000000 | 13.000000 | 1.000000 | 0.000000 | 38.040000 | 0.000000 | 31.580000 |
| 75% | 925.000000 | 1000.000000 | 14.000000 | 25.000000 | 1.000000 | 1.000000 | 66.670000 | 0.000000 | 43.270000 |
| max | 100000.000000 | 100000.000000 | 158.000000 | 2986.000000 | 50.000000 | 44.000000 | 10000.000000 | 50000.000000 | 9475.000000 |
min_value= df_kiva_v1.loc[:,"loan_amount"].min()
min_value
max_value= df_kiva_v1.loc[:,'loan_amount'].max()
max_value
100000.0
The range of loan_amount varies from 25 USD to 100k USD , we can categorize this value into bins as high funds project to low funds projects. countries undertaking big projects can then be identified.
## categorization of loan_amount
## based on the quantile values in the descriptive information lets define categories
categories = ['Bronze','Silver','Gold','Platinum']
min_value= df_kiva_v1.loc[:,"loan_amount"].min()
max_value= df_kiva_v1.loc[:,'loan_amount'].max()
loan_range = [min_value - .001 * abs(min_value), 250, 500,1000, max_value + .001 * abs(max_value)]
df_kiva_v1["loan_category"] = pd.cut(x=df_kiva_v1.loc[:,"loan_amount"],
bins=loan_range,
labels=categories)
df_kiva_v1
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | repayment_interval | borrower_female | borrower_male | repayment_amount | deficit_amount | avg_lending_amount | loan_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | irregular | 1 | 0 | 25.00 | 0.0 | 25.00 | Silver |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | irregular | 2 | 0 | 52.27 | 0.0 | 41.07 | Gold |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | bullet | 1 | 0 | 3.49 | 0.0 | 25.00 | Bronze |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | irregular | 1 | 0 | 18.18 | 0.0 | 25.00 | Bronze |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | monthly | 1 | 0 | 28.57 | 0.0 | 25.00 | Silver |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671179 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | monthly | 1 | 0 | 0.00 | 25.0 | 0.00 | Bronze |
| 671181 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | monthly | 1 | 0 | 0.00 | 25.0 | 0.00 | Bronze |
| 671182 | 0.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | monthly | 2 | 0 | 0.00 | 125.0 | 0.00 | Bronze |
| 671184 | 0.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | monthly | 2 | 0 | 0.00 | 875.0 | 0.00 | Gold |
| 671188 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | monthly | 1 | 0 | 0.00 | 250.0 | 0.00 | Bronze |
646617 rows × 18 columns
df_kiva_v1.loc[:,'loan_category'] = df_kiva_v1.loc[:,'loan_category'].astype('category')
df_kiva_v1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 646617 entries, 0 to 671188 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 646617 non-null float64 1 loan_amount 646617 non-null float64 2 activity 646617 non-null object 3 sector 646617 non-null object 4 use 646617 non-null object 5 country_code 646617 non-null object 6 country 646617 non-null object 7 region 646617 non-null object 8 currency 646617 non-null object 9 term_in_months 646617 non-null float64 10 lender_count 646617 non-null int64 11 repayment_interval 646617 non-null category 12 borrower_female 646617 non-null int64 13 borrower_male 646617 non-null int64 14 repayment_amount 646617 non-null float64 15 deficit_amount 646617 non-null float64 16 avg_lending_amount 646617 non-null float64 17 loan_category 646617 non-null category dtypes: category(2), float64(6), int64(3), object(7) memory usage: 85.1+ MB
df_kiva_v1.loc[:,'sector'] = df_kiva_v1.loc[:,'sector'].astype('category')
df_kiva_v1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 646617 entries, 0 to 671188 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 646617 non-null float64 1 loan_amount 646617 non-null float64 2 activity 646617 non-null object 3 sector 646617 non-null category 4 use 646617 non-null object 5 country_code 646617 non-null object 6 country 646617 non-null object 7 region 646617 non-null object 8 currency 646617 non-null object 9 term_in_months 646617 non-null float64 10 lender_count 646617 non-null int64 11 repayment_interval 646617 non-null category 12 borrower_female 646617 non-null int64 13 borrower_male 646617 non-null int64 14 repayment_amount 646617 non-null float64 15 deficit_amount 646617 non-null float64 16 avg_lending_amount 646617 non-null float64 17 loan_category 646617 non-null category dtypes: category(3), float64(6), int64(3), object(6) memory usage: 80.8+ MB
## creating an optimal copy of dataframe
df_kiva_v1_opti=df_kiva_v1.copy()
##Lets check memory usage of data frame
df_kiva_v1_opti.memory_usage(deep=True)
Index 5172936 funded_amount 5172936 loan_amount 5172936 activity 44762744 sector 648164 use 76085559 country_code 38150403 country 42321602 region 46198305 currency 38797020 term_in_months 5172936 lender_count 5172936 repayment_interval 647045 borrower_female 5172936 borrower_male 5172936 repayment_amount 5172936 deficit_amount 5172936 avg_lending_amount 5172936 loan_category 647041 dtype: int64
The columns 'use' has text input, we can analyse this data in a different method , for the current analysis we do not need this column. we can drop this column to make the data frame optimal.
Similarly the value of currency is not used in the analysis , we can delete this column.
The column region has 51k+ unknown values, we can use this in plotting in graph on world map.
The column country code can be used instead of country in graphs where it is difficult to show the name in full.
The column acitivity can be used in detail analysis of which sectors and their subsets are recieving more funding
##Lets drop the unused column
df_kiva_v1_opti = df_kiva_v1_opti.drop(columns=['use','currency'])
df_kiva_v1_opti.memory_usage(deep=True)
Index 5172936 funded_amount 5172936 loan_amount 5172936 activity 44762744 sector 648164 country_code 38150403 country 42321602 region 46198305 term_in_months 5172936 lender_count 5172936 repayment_interval 647045 borrower_female 5172936 borrower_male 5172936 repayment_amount 5172936 deficit_amount 5172936 avg_lending_amount 5172936 loan_category 647041 dtype: int64
df_kiva_v1_opti.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 646617 entries, 0 to 671188 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 646617 non-null float64 1 loan_amount 646617 non-null float64 2 activity 646617 non-null object 3 sector 646617 non-null category 4 country_code 646617 non-null object 5 country 646617 non-null object 6 region 646617 non-null object 7 term_in_months 646617 non-null float64 8 lender_count 646617 non-null int64 9 repayment_interval 646617 non-null category 10 borrower_female 646617 non-null int64 11 borrower_male 646617 non-null int64 12 repayment_amount 646617 non-null float64 13 deficit_amount 646617 non-null float64 14 avg_lending_amount 646617 non-null float64 15 loan_category 646617 non-null category dtypes: category(3), float64(6), int64(3), object(4) memory usage: 70.9+ MB
df_kiva_v1.describe()
| funded_amount | loan_amount | term_in_months | lender_count | borrower_female | borrower_male | repayment_amount | deficit_amount | avg_lending_amount | |
|---|---|---|---|---|---|---|---|---|---|
| count | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 | 646617.000000 |
| mean | 808.045242 | 866.458816 | 13.719826 | 21.115360 | 1.610250 | 0.419273 | 72.433627 | 58.414231 | 61.655100 |
| std | 1145.419041 | 1214.276106 | 8.479912 | 28.840674 | 3.077908 | 1.133063 | 127.806460 | 399.069402 | 181.508479 |
| min | 0.000000 | 25.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 275.000000 | 275.000000 | 8.000000 | 7.000000 | 1.000000 | 0.000000 | 23.750000 | 0.000000 | 26.830000 |
| 50% | 475.000000 | 500.000000 | 13.000000 | 13.000000 | 1.000000 | 0.000000 | 38.040000 | 0.000000 | 31.580000 |
| 75% | 925.000000 | 1000.000000 | 14.000000 | 25.000000 | 1.000000 | 1.000000 | 66.670000 | 0.000000 | 43.270000 |
| max | 100000.000000 | 100000.000000 | 158.000000 | 2986.000000 | 50.000000 | 44.000000 | 10000.000000 | 50000.000000 | 9475.000000 |
### save the optimized data frame to a csv file.
df_kiva_v1_opti.to_csv('kiva_v1_opti.csv',sep=',')
df_kiva_v1_opti.to_pickle('kiva_v1_after_preprocessing.pkl')